Inferring the success parameter j9 of a binomial 
model from small samples affected by background 

G. D'Agostini 
Abstract 

The problem of inferring the binomial parameter p from x successes 
obtained in n trials is reviewed and extended to take into account the 
presence of background, that can affect the data in two ways: a) fake 
successes are due to a background modeled as a Poisson process of 
known intensity; b) fake trials are due to a background modeled as a 
Poisson process of known intensity, each trial being characterized by a 
known success probability pf,. 

1 Introduction 

An important class of experiments consists in counting 'objects'. In fact, we 
are often interested in measuring their density in time, space, or both (here 
'density' stands for a general term, that in the domain of time is equivalent 
to 'rate') or the proportion of those objects that have a certain character 
in common. For example, particle physicists might be interested in cross 
sections and branching ratios, astronomers in density of galaxies in a region 
of the sky or in the ratio of galaxies exhibiting some special features. 

A well known problem in counting experiments is that we are rarely 
in the ideal situation of being able to count individually and at a given 
time all the objects of interest. More often we have to rely an a sample of 
them. Other problems that occur in real environments, especially in frontier 
research, are detector inefficiency and presence of background: sometimes 
we lose objects in counting; other times we might be confused by other 
objects that do not belong to the classes we are looking for, though they are 
observationally indistinguishable from the objects of interest. 

We focus here on the effect of background in measurements of propor- 
tions. For a extensive treatment of the effect of background on rates, i.e. 
measuring the intensity of a Poisson process in presence of background, see 
Ref. jlj, as well as chapters 7 and 13 of Ref. (2J. 
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The paper is structured as follows. In section|21we introduce the 'direct' 
and 'inverse' probabilistic problems related to the binomial distribution and 
the two cases of background that will be considered. In section |31 we go 
through the standard text-book case in which background is absent, but 
we discuss also, in some depth, the issue of how prior knowledge does or 
does not influence the probabilistic conclusions. Then, in the following two 
sections we come to the specific issue of this paper, and finally the paper 
ends with the customary short conclusions. 

2 The binomial distribution and its inverse prob- 
lem 

An important class of counting experiments can be modeled as independent 
Bernoulli trials. In each trial we believe that a success will occur with 
probability p, and a failure with probability q = 1 — p. If we consider n 
independent trials, all with the same probability p, we might be interested 
in the total number of successes, independently of their order. The total 
number of successes X can range between and n, and our belief on the 
outcome X = x can be evaluated from the probability of each success and 
some combinatorics. The result is the well known binomial distribution, 
hereafter indicated with i3„ „: 



We associate the formal quantities expected value and standard deviation 
to the concepts of (probabilistic) prevision and standard uncertainty. 

The binomial distribution describes what is sometimes called a direct 
probability problem, i.e. calculate the probability of the experimental out- 
come X (the effect) given n and an assumed value of p. The inverse problem 
is what concerns mostly scientists: infer p given n and x. In probabilistic 
terms, we are interested in f{p \ n,x). Probability inversions are performed, 
within probability theory, using Bayes theorem, that in this case reads 




having expected value and standard deviation 



E{X) = up 





(3) 



f{p\x,n,B) oc f{x\Bn,p)-fo(p) 



(4) 
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where foip) is the prior, f{p \ x, n, B) the posterior (or final) and f{x \ Bn,p) 
the likelihood. The proportionaUty factor is calculated from normalization. 
[Note the use of /(•) for the several probability functions as well as probabil- 
ity density functions (pdf), also within the same formula.] The solution of 
Eq. @ , related to the names of Bayes and Laplace, is presently a kind of first 
text book exercise in the so called Bayesian inference (see e.g. Ref. 
The issue of priors in this kind of problems will be discussed in detail in 
Sec. 13.11 especially for the critical cases of x = and x = n. 

The problem can be complicated by the presence of background. This 
is the main subject of this paper, and we shall focus on two kinds of back- 
ground. 

a) Background can only affect x. Think, for example, of a person 
shooting n times on a target, and counting, at the end, the numbers 
of scores x in order to evaluate his efficiency. If somebody else fires 
by mistake at random on his target, the number x will be affected by 
background. The same situation can happen in measuring efficiencies 
in those situations (for example due to high rate or loose timing) in 
which the time correlation between the equivalents of 'shooting' and 
'scoring' cannot be done on a event by event basis (think, for example, 
to neutron or photon detectors). 

The problem will be solved assuming that the background is described 
by a Poisson process of well known intensity r;,, that corresponds to 
a well known expected value A^, of the resulting Poisson distribution 
(in the time domain Ab = r;, • T, where T is measuring time). In other 
words, the observed x is the sum of two contributions: Xs due to the 
signal, binomially distributed with Bn,p-, plus Xh due to background, 
Poisson distributed with parameter \, indicated by Vx^. 

For large numbers (and still relatively low background) the problem 
is easy to solve: we subtract the expected number of background and 
calculate the proportion p = {x — Xb)/n. For small numbers, the 
'estimator' p can become smaller than or larger then 1. And, even if 
p comes out in the correct range, it is still affected by large uncertainty. 
Therefore we have to go through a rigorous probability inversion, that 
in this case is given by 

f{p\n,x,Xb) oc f{x = Xs + Xb\n,p,Xb)-fo{p), (5) 

where we have written explicitly in the likelihood that x is due to 
the sum of two (individually unobservable!) contributions Xg and Xb 
(hereafter the subscripts s and b stand for signal and background.) 
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b) The background can show up, at random, as independent 
'fake' trials, all with the same p;, of producing successes. An 

example, that has indeed prompted this paper, is that of the measuring 
the proportion of blue galaxies in a small region of sky where there 
are galaxies belonging to a cluster, as well as background galaxies, the 
average proportion of blue galaxies of which is well known. In this 
case both n and x have two contributions: 

n = Us + Ub (6) 

X = Xs + Xb (7) 

with 

rib ~ 'Pxi 

Xb ~ Bnb,Pb 

Xs ~ Bns,Ps, 

where '~' stands for 'follows a given distribution'. 

Again, the trivial large number (and not too large background) solu- 
tion is the proportion of background subtracted numbers, p = (x — 
Pb \) /{n — Aft). But in the most general case we need to infer p from 

f{Ps\n,x,Xb,Pb) oc f{x = Xs + Xb\n = ns + nb,Pb,Xb)- fo{p)- 

(11) 

We might be also interested also to other questions, like e.g. how many 
of the n object are due to the signal, i.e. 

f{ns\n,x,Xb,Pb). 
Indeed, the general problem lies in the joint inference 

f{ns,Ps\n,x,Xb,Pb), 

from which we can get other information, like the conditional distri- 
bution of Ps for any given number of events attributed to signal: 

f{ps\n,ns,x,Xb,pb) . 

Finally, we may also be interested in the rate of the signal objects, 
responsible of the signal objects in the sample (or, equivalently, to 
the Poisson distribution parameter Xg): 

f{Xs\n,x,Xb,Pb) . 
4 



(8) 
(9) 
(10) 







f(p) 
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Figure 1: Probability density function of the binomial parameter p, having ob- 
served X successes in n trials. |2| 



3 Inferring p in absence of background 

The solution of Eq.Q) depends, at least in principle, on the assumption 
on the prior fo{x). Taking a flat prior between and 1, that models our 
indifference on the possible values of p before we take into account the result 
of the experiment in which x successes were observed in n trials, we get (see 
e.g. 121): 

(n + 1)! 



f{p\x,n,B) 



(12) 



x\ (n — x) 

some examples of which are shown in Fig. ^ Expected value, mode (the 
value of p for which f{p) has the maximum) and variance of this distribution 
are: 



E(p) 

mode(p) = pm 
cr'^{p) = Var(p) 



x + l 

n + 2 

x/n 

(x + l){n — X + 1) 
(n + 3)(n + 2)2 

E{p) (1 - E{p)) 



1 



n + 3 



(13) 
(14) 
(15) 

(16) 



Eq. ()13() is known as "recursive Laplace formula", or "Laplace's rule of 
succession" . Not that there is no magic if the formula gives a sensible result 
even for the extreme cases x = and x = n for all values of n (even if n = !). 
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It is just a consequence of the prior: in absence of new information, we get 
out what we put in! 

From Fig. ^ we can see that for large numbers (and with x far from 
and from n) f{p) tends to a Gaussian. This is just the reflex of the limit to 
Gaussian of the binomial. In this large numbers limit E(]3) ~ = x/n and 
a{p) y^x/n{l — x/n)/n. 

3.1 Meaning and role of the prior: many data limit versus 
frontier type measurements 

One might worry about the role of the prior. Indeed, in some special cases 
of importance frontier type measurement one has to. However, in most 
routine cases, the prior just plays the role of a logical tool to allow probability 
inversion, but it is in fact absorbed in the normalization constant. (See 
extensive discussions in Ref. ^ and references therein.) 

In order to see the effect of the prior, let us model it in a easy and 
powerful way using a beta distribution, a very flexible tool to describe many 
situations of prior knowledge about a variable defined in the interval between 
and 1 (see Fig. [21). The beta distribution is the conjugate prior of the 
binomial distribution, i.e. prior and posterior belong to the same function 
family, with parameters updated by the data via the likelihood. In fact, a 
generic beta distribution in function of the variable p is given by 

;(HBeta(„)) = ^/-Hl-pr' {j^;:^!^ (17) 

The denominator is just for normalization and, indeed, the integral /3(r, s) = 
JqP'^~^{1 — p)^~^ dp defines the special function beta that names the dis- 
tribution. We immediately recognize Eq. (|12|) as a beta distribution of pa- 
rameters r = X + 1 and s = n — x + 1 [and the fact that /3(r, s) is equal to 
(r — l)!(s — l)!/(s + r — 1)! for integer arguments]. 

For a generic beta we get the following posterior (neglecting the irrelevant 
normalization factor): 

f{p\n,x,Beta{r,s)) oc [^''(l - p)""1 x [/'"^(l - p)^'"^] (18) 

where the subscript i stands for initial, synonym of prior. We can then see 
that the final distribution is still a beta with parameters = r j + x and 
Sf = Si + {n — x): the first parameter is updated by the number of successes, 
the second parameter by the number of failures. 
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Figure 2: Examples of Beta distributions for some values of r and s "7. The 
parameters in bold refer to continuous curves. 
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Expected value, mode and variance of the generic beta of parameters r 
and s are: 



E(X) 
mode(X) 
Var(X) 



r 



(20) 
(21) 
(22) 



r + s 



(r-l)/(r + s-2) 



[r > 1 and s > 1] 
[r + s > 1]. 



rs 



(r + s + 1) (r + s)2 



Then we can use these formulae for the beta posterior of parameters rj and 



The use of the conjugate prior in this problem demonstrates in a clear 
way how the inference becomes progressively independent from the prior 
information in the limit of a large amount of data: this happens when both 
X ^ ri and n — x ^> Sj. In this limit we get the same result we would 
get from a flat prior (r^ = Sj = 1, see Fig. For this reason in standard 
'routine' situation, we can quietly and safely take a flat prior. 

Instead, the treatment needs much more care in situations typical of 
'frontier research': small numbers, and often with no single 'successes'. Let 
us consider the latter case and let us assume a naive flat prior, that it is 
considered to represent 'indifference' of the parameter p between and 1. 
From Eq. ^ we get 



(The prior has been written explicitly among the conditions of the posterior.) 
Some examples are given in Fig. (jH)). As n increases, p is more and more 
constrained in proximity of 0. In these cases we are used to give upper limits 
at a certain level of confidence. The natural meaning that we give to this 
expression is that we are such and such percent confident that p is below the 
reported upper limit. In the Bayesian approach this is is straightforward, for 
confidence and probability are synonyms. For example, if we want to give 
the limit that makes us 95% sure that p is below it, i.e. P{p < pu^ 95) = 0.95, 
then we have to calculate the value p^o 95 such that the cumulative function 
F{puQcii) is equal to 0.95: 





(23) 



F{p, 



,95 




(24) 
(25) 



1- (l-p„)" = 0.95 



that yields 



Pwo.95 



= 1 - 



n- 



W05. 



(26) 
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Figure 3: Probability density function of the binomial parameter p, having ob- 
served no successes in n trials. 12 

For the three examples given in Fig. |31 with n = 3, 10 and 50, we have 
Pug 95 = 0.53, 0.24 and 0.057, respectively. These results are in order, as 
long the flat prior reflected our expectations about p, that it could be about 
equally likely in any sub-interval of fixed width in the interval between 
and 1 (and, for example, we believe that it is equally likely below 0.5 and 
above 0.5). 

However, this is often not the case in frontier research. Perhaps we were 
looking for a very rare process, with a very small p. Therefore, having done 
only 50 trials, we cannot say to be 95% sure that p is below 0.057. In 
fact, by logic, the previous statement implies that we are 5% sure that p is 
above 0.057, and this might seem too much for the scientist expert of the 
phenomenology under study. (Never ask mathematicians about priors! Ask 
yourselves and the colleagues you believe are the most knowledgeable experts 
of what you are studying.) In general I suggest to make the exercise of 
calculating a 50% upper or lower limit, i.e. the value that divides the possible 
values in two equiprobable regions: we are as confident that p is above as it 
is below 5- For n = 50 we have pu^ 5 = 0.013. If a physicist was looking 
for a rare process, he/she would be highly embarrassed to report to be 50% 
confident that p is above 0.013. But he/should be equally embarrassed to 
report to be 95% confident that p is below 0.057, because both statements 
are logical consequence of the same result, that is Eq. (|^ . If this is the 
case, a better grounded prior is needed, instead of just a 'default' uniform. 
For example one might thing that several order of magnitudes in the small p 
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Figure 4: Rescaled likelihoods for x = and some values of n 

range are considered equally possible. This give rise to a prior that is uniform 
in Inp (within a range Inpmin and lnpmax)i equivalent to fo{p) oc 1/p with 
lower and upper cut-off's. 

Anyway, instead of playing blindly with mathematics, looking around 
for 'objective' priors, or priors that come from abstract arguments, it is 
important to understand at once the role of prior and likelihood. Priors are 
logically important to make a 'probably inversion' via the Bayes formula, 
and it is a matter of fact that no other route to probabilistic inference exists. 
The task of the likelihood is to modify our beliefs, distorting the pdf that 
models them. Let us plot the three likelihoods of the three cases of Fig. IHl 
rescaled to the asymptotic value p — > (constant factors are irrelevant in 
likelihoods). It is preferable to plot them in a log scale along the abscissa 
to remember that several orders of magnitudes are involved (Fig. 0J). 

We see from the figure that in the high p region the beliefs expressed by 
the prior are strongly dumped. If we were convinced that p was in that region 
we have to dramatically review our beliefs. With the increasing number of 
trials, the region of 'excluded' values of logp increases too. 

Instead, for very small values of p, the likelihood becomes flat, i.e. equal 
to the asymptotic value p — > 0. The region of flat likelihood represents the 
values of p for which the experiment loses sensitivity: if scientific motivated 
priors concentrate the probability mass in that region, then the experiment 
is irrelevant to change our convictions about p. 
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Formally the rescaled likelihood 



TZ{p; n, X = 0) 



f{x = 0\n, p) 



(27) 



f{x = I n, p ^ 0) 



equal to (1 — p)" in this case, is a functions that gives the Bayes factor of a 
generic p with respect to the reference point p = for which the experimental 
sensitivity is certainly lost. Using the Bayes formula, TZ{p; n, x = 0) can 
rewritten as 



to show that it can be interpreted as a relative belief updating factor, in the 
sense that it gives the updating factor for each value of p with respect to 
that at the asymptotic value p ^ 0. 

We see that this TZ function gives a way to report an upper limit that 
do not depend on prior: it can be any conventional value in the region 
of transition from 7^ = 1 to 7?. = 0. However, this limit cannot have a 
probabilistic meaning, because does not depend on prior. It is instead a 
sensitivity bound, roughly separating the excluded high p value from the the 
small p values about which the experiment has nothing to say.^ 

For further discussion about the role of prior in frontier research, applied 
to the Poisson process, see Ref. For examples of experimental results 
provided with the TZ function, see Refs. @J|Slin]. 

4 Poisson background on the observed number of 
'successes' 

Imagine now that the x successes might contains an unknown number of 
background events Xb, of which we only know their expected value A;,, esti- 
mated somehow and about which we are quite sure (i.e. uncertainty about 
Xb is initially neglected — it will be indicated at the end of the section how 
to handle it). We make the assumption that the background events come 
at random and are described by a Poisson process of intensity Vb, such that 
the Poisson parameter A;, is equal to x AT in the domain of time, with 
AT the observation time. (But we could as well reason in other domains, 
like objects per unit of length, surface, volume, or solid angle. The den- 
sity/intensity parameter r will have different dimensions depending on the 
context, while A will always be dimensionless.) 

^"Wovon man nicht reden kann, dariiber muss man schweigen" (L. Wittgenstein). 



TZ{p; n, X 




(28) 



11 



The number of observed successes x has now two contributions: 

X = Xs + Xb (29) 
Xs ~ Bn,p (30) 
Xb ~ n,, (31) 

In order to use Bayes theorem we need to calculate f{x \ n, p, Xb), that is 
f{x = Xs + Xb \ Sn,p, "PaJ) i-e. is the probability function of the sum of a 
binomial variable and a Poisson variable. The combined probability function 
is give by (see e.g. section 4.4 of Ref. |5]): 

f{xs\Bn,pB)f{xb\rx,) (32) 

where Sx^x^+xt is the Kronecker delta that constrains the possible values of 
Xs and Xb in the sum {xs and Xb run from to the maximum allowed by the 
constrain). Note that we do not need to calculate this probability function 
for all X, but only for the number of actually observed successes. 
The inferential result about p is finally given by 

f{p I n, p, Xb) oc /(x I Bn,p, VxJ h{p) ■ (33) 

An example is shown in Fig. El for n = 10, x = 7 and an expected number 
of background events ranging between and 10, as described in the figure 
caption. The upper plot of the figure is obtained by a uniform prior (priors 
are represented with dashed lines in this figure). As an exercise, let us also 
show in the lower plot of the figure the results obtained using a broad prior 
still centered aX p = 0.5, but that excludes the extreme values and 1, as 
it is often the case in practical cases. This kind of prior has been modeled 
here with a beta function of parameters rj = 2 and Sj = 2. 

For the cases of expected background different from zero we have also 
evaluated the TZ function, defined in analogy to Eq. (|27j) as TZ{p; n, x, Xb) = 
f{x I n, p, Xb)/ f{x \n, p ^ 0, Xb) ■ Note that, while Eq. (P7|) is only defined 
for X 7^ 0, since a single observation makes p = impossible, that limitation 
does not hold any longer in the case of not null expected background. In 
fact, it is important to remember that, as soon as we have background, 
there is some chance that all observed events are due to it (remember that a 
Poisson variable is defined for all non negative integers!). This is essentially 
the reasons why in this case the likelihoods tend to a positive value for p ^ 
(I like to call 'open' this kind of likelihoods [Ij). As discussed above, the 
power of the data to update the believes on p is self-evident in a log-plot. 
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Figure 6: Relative believe updating factor of p for n = 10, x = 7 and several 
hypotheses of background: Xb = 1, 2, 4, 6, 8, 10. 

We seen in Fig. El that, essentially, the data do not provide any relevant 
information for values of p below 0.01. 

Let us also see what happens when the prior concentrates our beliefs at 
small values of p, though in principle allowing all values of from to 1. Such 
a prior can be modeled with a log-normal distribution of suitable parameters 
(-4 and 1), i.e. foip) = exp [— (logp + 4)^)/2] / (-v/2 vrp), with an upper cut- 
off at p = 1 (the probability that such a distribution gives a value above 1 
is 3.2 10^^). Expected value and standard deviation of Lognormal(-4,l) are 
0.03 and 0.04, respectively. The result is given in Fig. [71 where the prior is 
indicated with a dashed line. 

We see that, with increasing expected background, the posteriors are 
essentially equal to the prior. Instead, in case of null background, ten trials 
are already sufficiently to dramatically change our prior beliefs. For example, 
initially there was 4.5% probability that p was above 0.1. Finally there is 
only 0.09% probability for p to be below 0.1. 

The case of null background is also shown in Fig. |H1 where the results 
of the three different priors are compared. We see that passing from a 
Beta(l, 1) to a Beta(2,2), makes little change in the conclusion. Instead, 
a log-normal prior distribution peaked at low values of p changes quite a 
lot the shape of the distribution, but not really the substance of the result 
(expected value and standard deviation for the three cases are: 0.67, 0.13; 
0.64, 0.12; 0.49, 0.16). Anyway, the prior does correctly its job and there 
should be no wonder that the final pdf drifts somehow to the left side, to 
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Figure 7: Inference of p for n = 10, a; = 7, assuming a log-normal prior 
(dashed line) peaked at low p, and with several hypotheses of background (As = 
0,1,2, 4, 6, 8, 10). 
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Figure 8: Inference oi p for n = 10, a; = 7 in absence of background, with three 
different priors. 
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Figure 9: Sequential inference of p, starting from a prior peaked at low values, 
given two experiments, each with n = 10 and x = 7. 

take into account a prior knowledge according to which 7 successes in 10 
trials was really a 'surprising event'. 

Those who share such a prior need more solid data to be convinced that 
p could be much larger than what they initially believed. Let make the 
exercise of looking at what happens if a second experiment gives exactly 
the same outcome {x = 7 with n = 10). The Bayes formula is applied 
sequentially, i.e. the posterior of the first inference become the prior of the 
second inference. That is equivalent to multiply the two priors (we assume 
conditional independence of the two observations). The results are given 
in Fig. ini (By the way, the final result is equivalent to having observed 14 
successes in 20 trials, as it should be — the correct updating property is one 
of the intrinsic nice features of the Bayesian approach) . 

4.1 Uncertainty on the expected background 

In these examples we made the assumption that the expected number of 
background events is well known. If this is not the case, we can quantify our 
uncertainty about it by a pdf f{Xb), whose modeling depends on our best 
knowledge about A^. Taking account of this uncertainty in a probabilistic 
approach is rather simple, at least conceptually (calculations can be quite 
complicate, but this is a different question). In fact, applying probability 
theory we get: 

POO 

f{p\x,n) = / f{p\x,n,Xb)fiXb)dXb. (34) 
Jo 
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We recognize in this formula that the pdf that takes into account all possible 
values of A is a weighted average of all dependent pdf's, with a weight 
equal to f{\b)- 

5 Poisson background on the observed number of 
'trials' and of 'successes' 

Let us know move to problem &^ of the introduction. Again, we consider 
only the background parameters are well known, and refer to the previous 
subsection for treating their uncertainty. To summarize, that is what we 
assume to know with certainty: 

n : the total observed numbers of 'objects', Us of which are due to signal and 
rib to background; but these two numbers are not directly observable 
and can only be inferred; 

X : the total observed numbers of the 'objects' of the subclass of interest, 
sum of the unobservable Xg and Xb] 

Xb : the expected number of background objects; 

Pb : the expected proportion of successes due to the background events. 

As we discussed in the introduction, we are interested in inferring the number 
of signal objects n^, as well as the parameter ps of the 'signal'. We need then 
to build a likelihood that connects the observed numbers to all quantities 
we want to infer. Therefore we need to calculate the probability function 
f{x\n, Us, Ps, Ab, Pb). 

Let us first calculate the probability function /(x | n^, ps Ub, Pb) that de- 
pends on the unobservable Ug and Ub- This is the probability function of the 
sum of two binomial variables: 

f2B{x\ns, PsTlb, Pb) = '^6^^a:s+xtfiXs\Bn„p,)-f{Xb\l3nt^p,,), (35) 

where Xg ranges between and n^, and Xb ranges between and Ub- x can 
vary between and + Ub, has expected value E(x) = UsPs + nbPb and 
variance Var(x) = UsPs (1 — Ps) + nbPb (1 — Pb)- As for Eq. (jSH), we need 
to evaluate Eq. (|35|) only for the observed number of successes. Contrary 
to the implicit convention within this paper to use the same symbol /(•) 
meaning different probability functions and pdf's, we name Eq. (|35() f2B for 
later convenience. 
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In order to obtain the general likelihood we need, two observations are 
in order: 

• Since x depends from A only via rih, then f{x \ n^, ps rib, Pb, \) is equal 
to f2Bix\ns, Ps rib, Pb)- 

• The likelihood that depends also on n can obtained from 
f{x\ns, PsTib, Pb, \) by the following reasoning: 

— if n = + Ub, then 

f{x I n, Us, Ps Ub, Pb, Xb) = f{x I Us, Ps rib, Pb, k) ; 

— else 

f{x\n. Us, PsUb, Pb, Afe) = 0. 

It follows that 

f{x\n. Us, Ps, rib, Pb, Xb) = fix\ns,Psnb,Pb,Xb)5n,n,+nt (36) 

= f2B{x\ns, Psrib, Pb)Sn,ns+nt ■ (37) 

At this point we get rid of rib in the conditions, taking account its possible 
values and their probabilities, given A;,: 

f{x\n. Us, Ps, Pb, Xb) = '^f{x\n,ns,Ps,nb,Pb,Xb)f{nb\Vx^), (38) 
i.e. 

f{x\n. Us, Ps, Pb, Xb) = ^f2B{x\ns,Psnb,Pb)f{rib\V\^)6n,ns+nt,{^^) 

rib 

where rib ranges between and x, due to the 5n,ns+nft condition. Finally, we 
can use Eq. ()39|) in Bayes theorem to infer Ug and ps ■ 

f{ns,Ps\x,n,Xb,Pb) oc f{x\n,ns,Ps,Pb,\)h{ns,Ps) (40) 
f{ps\x,n,Xb,Pb) = ^ f{ns, Ps\x, n, Xb, Pb) (41) 



f{ns I X, n, Xb, Pb) = J f{ns, Ps \ x, n, Xb, pb) dps (42) 
fh \x n n Xb Vb) - /K'^^l^' k^P^) (43) 

J \Ps X, n, lis, Ab, Pb) — \ \ \ ' \^'^) 

f{ns I X, n, Xb, Pb) 

We give now some numerical examples. For simplicity (and because we 
are not thinking to a specific physical case) we take uniform priors, i.e. 
fo{ns, Ps) = const. We refer to section 1X11 for an extensive discussion on 
prior and on critical 'frontier' cases. 



18 



5.1 Inferring ps 

If priors are uniform then, Eq. (|41jl becomes 

f{Ps\x, n, X^Pb) (X J2 f^>3{x\ns, PsUb, Pb) f{nb\'P\^)6n,ns+nt- (44) 

ns,ni, 

Figm'elTUl gives the result for x = 9, n = 12, and assuming several hypothesis 
for Xb and pb- 

• The upper plot is for pb = 0.75, equal to x/n. The curves are for 
Af, = 0, 1, 2, 4, 6, 8, 10, 12 and 14, with the order indicated (whenever 
possible) in the figure. If the expected background is null, we recover 
the simple result we already know. As the expected background in- 
creases, f{ps) gets broader, because the inference is based on a smaller 
number of objects attributed to the signals and because we are uncer- 
tain on the number of events actually due the background. In a very 
noisy environments (A^ « n, or even larger), the data provide very lit- 
tle information about ps and, essentially, the prior pdf (dashed curve) 
is recovered. Note also that for all values of A;, the posterior f{ps) is 
peaked at x/n = 0.75. This is due to the fact that pb was equal to 
the observed ratio x/n, therefore, for any hypothesis of nb attributed 
to the background, Xb = Pb "-6 counts are in average 'subtracted' from 
X (this is properly done in an automatic way in the Bayes formula, 
followed by marginalization) . 

• The situation gets more interesting when pb differs from x/n. 

The middle plot in the figure is for pb = 0.25. Again, the case A^ = 
gives the the pdf we already know. But as soon as some background 
is hypothesized, the curves start to drift to the right side. That is 
because high background with low pb favors large values of ps- 

The opposite happens if we think that background is characterized by 
large pb, as shown in the bottom plot of the figure. 

5.2 Inferring Ug and A^ 

The histograms of Fig. show examples of the probability distributions 
of TT-s for Ab = 4 and three different hypotheses for pb- These distributions 
quantify how much we believe that out of the observed n belong to the 
signal. [By the way, the number nb of background objects present in the 
data can be inferred as complement to ns, since the two numbers are linearly 
dependent. It follows that /(rife | x, n. A;,, pb) = f{n — | x, n, Xb, Pb)-] 
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Figure 10: Inference about Ps for n = 12 and x = 9, depending on the expected 
background [Xt = 0, 1, 2, 4, 6, 8, 10, 14, as (possibly) indicated by the number 
above the Unes]. The three plots are obtained by three different hypotheses of pb. 
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Figure 11: Inference about Ug (histograms) and (continuous lines) for n = 12 
and a; = 9, assuming Af, = 4 and three values of pf,: 0.75, 0.25 and 0.95 (top down). 
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A different question is to infer the the Poisson of the signal. Using 
once more Bayes theorem we get, under the hypothesis of signal objects: 

f{Xs\ns) oc f{ns\VxJM^s) (45) 
Assuming a uniform prior for As we get (see e.g. Ref. 0): 

/(A, In,) = (46) 

with expected value and variance both equal to + 1 and mode equal to 
Hs (the expected value is shifted on the right side of the mode because the 
distribution is skewed to the right). Figure TTH shows these pdf's, for 
ranging from to 12 and assuming a uniform prior for A,. 

As far the pdf of A, that depends on all possible values of n,, each with is 
probability, is concerned, we get from probability theory [and remembering 
that, indeed, /(A, | n,, x, n. A;,, pt) is equal to /(A^ | n,), because depends 
only on A,, and then the other way around]: 

f{Xs\x,n,Xh,ph) oc ^ /(As I n,) /(n, I X, n, Ab, pb) , (47) 
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i.e. the pdf of is the weighted average^ of the several depending pdf 's. 

The results for the example we are considering in this section are given 
in the plots of Fig. 111! 

6 Conclusions 

The classical inverse problem related to the binomial distribution has been 
reviewed and extended to the presence of background either only on the num- 
ber of 'successes', or on the trials themselves. The probabilistic approach 
followed here allows to treat the problems only using probability rules. The 
results are always in qualitative agreement with intuition, are consistent 
with observations and prior knowledge and, never lead to absurdities, like p 
outside the range and 1. 

The role of the priors, that are crucial to allow the probabilistic inversion 
and very useful to balance in the proper way prior knowledge and evidence 
from new observations, has been also emphasized, showing when they can 
be neglected and when they are so critical that it is preferable not to provide 
probabilistic conclusions. 

It is a pleasure to thank Stefano Andreon for several stimulating discus- 
sions on the subject. 



^It follows that all moments of the distribution are weighted averages of the moments 
of the conditional distribution. Then, expected value and variance of As can be easily 
obtained from the conditional expected values and variances: 



E(As) oc 
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